10 research outputs found
Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints
Semantic annotations have to satisfy quality constraints to be useful for
digital libraries, which is particularly challenging on large and diverse
datasets. Confidence scores of multi-label classification methods typically
refer only to the relevance of particular subjects, disregarding indicators of
insufficient content representation at the document-level. Therefore, we
propose a novel approach that detects documents rather than concepts where
quality criteria are met. Our approach uses a deep, multi-layered regression
architecture, which comprises a variety of content-based indicators. We
evaluated multiple configurations using text collections from law and
economics, where the available content is restricted to very short texts.
Notably, we demonstrate that the proposed quality estimation technique can
determine subsets of the previously unseen data where considerable gains in
document-level recall can be achieved, while upholding precision at the same
time. Hence, the approach effectively performs a filtering that ensures high
data quality standards in operative information retrieval systems.Comment: authors' manuscript, paper submitted to TPDL-2018 conference, 12
page
Exploiting Anti-monotonicity of Multi-label Evaluation Measures for Inducing Multi-label Rules
Exploiting dependencies between labels is considered to be crucial for
multi-label classification. Rules are able to expose label dependencies such as
implications, subsumptions or exclusions in a human-comprehensible and
interpretable manner. However, the induction of rules with multiple labels in
the head is particularly challenging, as the number of label combinations which
must be taken into account for each rule grows exponentially with the number of
available labels. To overcome this limitation, algorithms for exhaustive rule
mining typically use properties such as anti-monotonicity or decomposability in
order to prune the search space. In the present paper, we examine whether
commonly used multi-label evaluation metrics satisfy these properties and
therefore are suited to prune the search space for multi-label heads.Comment: Preprint version. To appear in: Proceedings of the Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD) 2018. See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3074 for further
information. arXiv admin note: text overlap with arXiv:1812.0005
Learning Interpretable Rules for Multi-label Classification
Multi-label classification (MLC) is a supervised learning problem in which,
contrary to standard multiclass classification, an instance can be associated
with several class labels simultaneously. In this chapter, we advocate a
rule-based approach to multi-label classification. Rule learning algorithms are
often employed when one is not only interested in accurate predictions, but
also requires an interpretable theory that can be understood, analyzed, and
qualitatively evaluated by domain experts. Ideally, by revealing patterns and
regularities contained in the data, a rule-based theory yields new insights in
the application domain. Recently, several authors have started to investigate
how rule-based models can be used for modeling multi-label data. Discussing
this task in detail, we highlight some of the problems that make rule learning
considerably more challenging for MLC than for conventional classification.
While mainly focusing on our own previous work, we also provide a short
overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models
in Computer Vision and Machine Learning. The Springer Series on Challenges in
Machine Learning. Springer (2018). See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further
informatio
Multi-Target Prediction: A Unifying View on Problems and Methods
Multi-target prediction (MTP) is concerned with the simultaneous prediction
of multiple target variables of diverse type. Due to its enormous application
potential, it has developed into an active and rapidly expanding research field
that combines several subfields of machine learning, including multivariate
regression, multi-label classification, multi-task learning, dyadic prediction,
zero-shot learning, network inference, and matrix completion. In this paper, we
present a unifying view on MTP problems and methods. First, we formally discuss
commonalities and differences between existing MTP problems. To this end, we
introduce a general framework that covers the above subfields as special cases.
As a second contribution, we provide a structured overview of MTP methods. This
is accomplished by identifying a number of key properties, which distinguish
such methods and determine their suitability for different types of problems.
Finally, we also discuss a few challenges for future research
Scalable Text Classification with Sparse Generative Modeling
Abstract. Machine learning technology faces challenges in handling “Big Data”: vast volumes of online data such as web pages, news sto-ries and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the spar-sity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multi-label extensions: Binary Relevance, Label Powerset and Multi-label Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model meta-parameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for large-scale multi-label classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in mil-liseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks